The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you’d like.
Remember that we will be grading the quality of your code and visuals
alongside the correctness of your answers. Please try to use the
tidyverse as much as possible (instead of base R and explicit loops).
Please do not bring in any outside data, and use the provided data as
truth (for example, some “home” games have been played at secondary
locations, including TOR’s entire 2020-21 season. These are not
reflected in the data and you do not need to account for this.) Note
that the OKC and DEN 2024-25 schedules in
schedule_24_partial.csv intentionally include only 80
games, as the league holds 2 games out for each team in the middle of
December due to unknown NBA Cup matchups. Do not assign specific games
to fill those two slots.
Note:
Throughout this document, any season column
represents the year each season started. For example, the 2015-16 season
will be in the dataset as 2015. We may refer to a season by just this
number (e.g. 2015) instead of the full text (e.g. 2015-16).
Question 1: 26 4-in-6 stretches in OKC’s draft schedule.
Question 2: 25.1 4-in-6 stretches on average.
Question 3:
Question 4: This is a written question. Please leave your response in the document under Question 4.
Question 5:
Please show your work in the document, you don’t need anything here.
Question 9:
library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project,
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv("~/Downloads/schedule.csv")
draft_schedule <- read_csv("~/Downloads/schedule_24_partial.csv")
locations <- read_csv("~/Downloads/locations.csv")
game_data <- read_csv("~/Downloads/team_game_data.csv")
In this section, you’re going to work to answer questions using NBA scheduling data.
QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)
# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.
library(tidyverse)
draft_schedule_OKC <- draft_schedule %>%
filter(team == "OKC")
count_4in6<- function(data, date_col){
data %>%
# Standarize the data
transmute(gamedate = as.Date({{date_col}})) %>%
arrange(gamedate) %>%
mutate(n_games = purrr::map_int(gamedate, ~ sum(gamedate >= .x & gamedate <= .x+5))) %>%
filter(n_games >= 4) %>%
mutate(window_end = gamedate + 5) %>%
select(window_start = gamedate, window_end, n_games)
}
res <- count_4in6(draft_schedule_OKC, gamedate)
res
## # A tibble: 26 × 3
## window_start window_end n_games
## <date> <date> <int>
## 1 2024-10-30 2024-11-04 4
## 2 2024-11-01 2024-11-06 4
## 3 2024-11-06 2024-11-11 4
## 4 2024-11-08 2024-11-13 4
## 5 2024-11-10 2024-11-15 4
## 6 2024-11-15 2024-11-20 4
## 7 2024-12-26 2024-12-31 4
## 8 2024-12-28 2025-01-02 4
## 9 2024-12-29 2025-01-03 4
## 10 2024-12-31 2025-01-05 4
## # ℹ 16 more rows
nrow(res)
## [1] 26
ANSWER 1:
26 4-in-6 stretches in OKC’s draft schedule.
QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.
by_team_season <- schedule %>%
filter(season >= 2014, season <= 2023) %>%
group_by(team,season) %>%
summarise(games_played = n(),
four_in_six = nrow(count_4in6(pick(gamedate), gamedate)),
.groups = "drop"
) %>%
mutate(four_in_six_per82 = four_in_six * 82 / games_played)
overall_avg_q2 <- by_team_season %>%
summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE))
overall_avg_q2
## # A tibble: 1 × 1
## avg_4in6_per82
## <dbl>
## 1 25.1
ANSWER 2:
25.1 4-in-6 stretches on average.
QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.
by_team_season <- schedule %>%
filter(season >= 2014, season <= 2023) %>%
group_by(team,season) %>%
summarise(games_played = n(),
four_in_six = nrow(count_4in6(pick(gamedate), gamedate)),
.groups = "drop") %>%
mutate(four_in_six_per82 = four_in_six * 82 / games_played)
#Avg per team across seasons
team_avgs <- by_team_season %>%
group_by(team) %>%
summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE)) %>%
arrange(desc(avg_4in6_per82))
#Identify highest and lowest
highest <- team_avgs %>% slice(1)
lowest <- team_avgs %>% slice(n())
highest
## # A tibble: 1 × 2
## team avg_4in6_per82
## <chr> <dbl>
## 1 CHA 28.1
lowest
## # A tibble: 1 × 2
## team avg_4in6_per82
## <chr> <dbl>
## 1 NYK 22.2
ANSWER 3:
QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?
The difference between the most CHA (28.1) and least NYK(22.2) is nearly 6 stretches per 82 games. This difference is likely the result of chance, since it is fairly small compared to the amount of teams in the league and the 10 seasons considered.
ANSWER 4:
QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?
bkn_games <- game_data %>%
mutate(gamedate = as.Date(gamedate)) %>%
filter(def_team == "BKN" , season == 2023) %>%
mutate(opp_fgm = fg2made + fg3made,
opp_fga = fg2attempted + fg3attempted,
opponent_efg_pct = (opp_fgm + 0.5 * fg3made) / opp_fga)
#Overall defensive eFG%
bkn_def_efg <- mean(bkn_games$opponent_efg_pct, na.rm = TRUE)
#Add B2B info
opp_b2b <- schedule %>%
mutate(gamedate = as.Date(gamedate)) %>%
filter(season == 2023) %>%
arrange(team, gamedate) %>%
group_by(team) %>%
mutate(prev_game = lag(gamedate), b2b = (gamedate - prev_game == 1)) %>%
ungroup() %>%
select(team, gamedate, b2b)
bkn_games_b2b <- bkn_games %>%
left_join(opp_b2b %>% select(team, gamedate, b2b), by = c("off_team" = "team", "gamedate" = "gamedate"))
#Defensive eFG% when opponent on B2B
bkn_def_efg_b2b <- mean(bkn_games_b2b$opponent_efg_pct[bkn_games_b2b$b2b == TRUE], na.rm = TRUE)
#Round
round(bkn_def_efg * 100,1)
## [1] 54.5
round(bkn_def_efg_b2b * 100, 1)
## [1] 53.6
ANSWER 5:
This is an intentionally open ended section, and there are multiple approaches you could take to have a successful project. Feel free to be creative. However, for this section, please consider only the density of games and travel schedule, not the relative on-court strength of different teams.
QUESTION: Please identify at least 2 trends in scheduling over time. In other words, how are the more recent schedules different from the schedules of the past? Please include a visual (plot or styled table) highlighting or explaining each trend and include a brief written description of your findings.
#B2B Trends
b2b_trend <- schedule %>%
mutate(gamedate = as.Date(gamedate)) %>%
arrange(team, gamedate) %>%
group_by(team, season) %>%
mutate(prev_game = lag(gamedate), b2b = (gamedate - prev_game == 1)) %>%
summarise(b2b_total = sum(b2b, na.rm = TRUE), .groups = "drop") %>% group_by(season) %>%
summarise(avg_b2b_per_team = mean(b2b_total), .groups = "drop")
#Visualization
library(ggplot2)
ggplot(b2b_trend, aes( x = season, y = avg_b2b_per_team)) +
geom_line() +
geom_point() +
labs(
title = "Average Back to Backs per Team by Season",
x = "Season",
y = "Avg B2B per Team"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
road_trips_team_season <- schedule %>%
mutate(gamedate = as.Date(gamedate)) %>%
arrange(team, gamedate) %>%
group_by(team, season) %>%
#Flag Away Games and ID each trip
mutate(
away = (home == 0) ,
prev_away = c(FALSE, away[-length(away)]) ,
new_trip = away & (!prev_away) ,
trip_id = cumsum(new_trip)
) %>%
#Filter Away Games
filter(away) %>%
group_by(team, season, trip_id) %>%
#Calculate Length and Count 3+ Stretches
summarise(trip_len = n(), .groups = "drop") %>%
group_by(team, season) %>%
summarise(n_3plus_road_trips = sum(trip_len >= 3), .groups = "drop")
#Visualization
road_trip_trend <- road_trips_team_season %>%
group_by(season) %>%
summarise(avg_3plus_road_trips = mean(n_3plus_road_trips), .groups = "drop")
ggplot(road_trip_trend, aes(x = season, y = avg_3plus_road_trips)) +
geom_line() +
geom_point() +
labs(title = "Average 3+ Road Game Stretches by Team per Season",
x = "Season",
y = "Avg 3+ Road Game Stretches per Team") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(limits = c(0,10)) #Widen scale
ANSWER 6:
From 2014-15 to 2023-24, the average number of back to back games per team has steadily declined. On average teams were player around 18 to 19 back to back per season in early years, but that number has dropped to around 14 in recent years. It seems the league reduced the amount of back to backs played each year. This can help reduce schedule congestion and prioritize player health.
Looking at road stretches with 3 or more games, the trend has been relatively stable, averaging between 5 and 7 per team per season. While there is some year to year variation, the long term pattern is consistent. The league appears to balance these trips for travel efficiency.
QUESTION: Please design a plotting tool to help visualize a team’s schedule for a season. The plot should cover the whole season and should help the viewer contextualize and understand a team’s schedule, potentially highlighting periods of excessive travel, dense blocks of games, or other schedule anomalies. If you can, making the plots interactive (for example through the plotly package) is a bonus.
Please use this tool to plot OKC and DEN’s provided 80-game 2024-25 schedules.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Filter
okc_den <- draft_schedule %>%
filter(team == "OKC" | team == "DEN") %>%
mutate(gamedate = as.Date(gamedate)) %>%
arrange(team, gamedate) %>%
group_by(team) %>%
#Track Previous Game Date
mutate(
prev_date = c(NA, gamedate[-length(gamedate)]),
#Track Next Game Date
next_date = c(gamedate[-1], NA),
#B2B
b2b_any = (!is.na(prev_date) & (gamedate-prev_date == 1))|
(!is.na(next_date) & (next_date - gamedate == 1)),
#Home or Away
home_fac = ifelse(home == 1, "Home", "Away"),
#Hover for Plotly
hover_txt = paste0(
"Team: ", team,
"<br>Date:", gamedate,
"<br>Opponent: ", opponent,
"<br>Venue: ", home_fac,
"<br>B2B: ", ifelse(b2b_any, "Yes", "No")
)
) %>%
ungroup()
p <- ggplot(okc_den, aes(x = gamedate,
y = 1,
color = home_fac,
shape = b2b_any,
text = hover_txt)) +
geom_point( size = 3) +
scale_shape_manual(values = c(`FALSE` = 16, `TRUE` = 17), name = "Back to Back") +
scale_y_continuous(NULL, breaks = NULL) +
labs(
title = "2024-25 OKC/DEN Schedule",
x = "Date",
y = NULL,
color = "Venue",
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()) +
facet_wrap(~ team, ncol = 1, scales = "free_y")
gp <- ggplotly(p, tooltip = "text") %>%
layout(
legend = list(orientation = "h", y = -0.15),
xaxis = list(rangeslider = list(visible = TRUE))
)
gp
ANSWER 7:
QUESTION: Using your tool, what is the best and worst part of OKC’s 2024-25 draft schedule? Please give your answer as a short brief to members of the front office and coaching staff to set expectations going into the season. You can include context from past schedules.
ANSWER 8:
The best part of OKC’s 2024-25 draft schedule is from 03/27/25 to 04/02/25, when the team plays 4 straight home games with no back to back games during the stretch. This gives the opportunity to get valuable rest and gear up for a playoff run.
The toughest stretch occurs from 02/21 to 03/03, where OKC plays 7 games in 11 days, including two different back to backs Feb(23-24) and Mar(2-3). This period has disruptive travel with 5 away games and the only 2 home games coming on the back end of back to backs. This creates a significant challenge to catch any rhythm or gain any rest.
QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.
If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).
sched_feat <- schedule %>%
mutate(gamedate = as.Date(gamedate),
away = (home == 0)) %>%
filter(season >= 2019, season <= 2023) %>%
arrange(team, gamedate) %>%
group_by(team, season) %>%
mutate(
prev_date = lag(gamedate),
next_date = lead(gamedate),
#B2B
b2b_any = (!is.na(prev_date) & (gamedate - prev_date == 1)) |
(!is.na(next_date) & (next_date - gamedate == 1)),
#Count 4in6
n_in_6 = purrr::map_int(gamedate, \(d) sum(gamedate >= d & gamedate <= d + 5)),
four_in_six = (n_in_6 >= 4),
#Road Trips
prev_away = lag(away, default = FALSE),
new_trip = away & !prev_away,
trip_id = ifelse(away,cumsum(new_trip), NA_integer_)
) %>%
group_by(team, season, trip_id) %>%
mutate(road_trip_index = ifelse(away, row_number(), 0L)) %>%
ungroup() %>%
mutate(
b2b = as.integer(b2b_any),
four_in_six = as.integer(four_in_six),
road_trip_index = replace_na(road_trip_index, 0L),
home = as.integer(home),
win = as.integer(win)
) %>%
select(season, team, gamedate, home, away, win, b2b, four_in_six, road_trip_index)
#Fit model
sched_scored <- sched_feat %>%
filter(season >= 2019, season <= 2023)
sched_model <- glm(
win ~ b2b + four_in_six + road_trip_index + home,
data = sched_scored,
family = binomial()
)
summary(sched_model)
##
## Call:
## glm(formula = win ~ b2b + four_in_six + road_trip_index + home,
## family = binomial(), data = sched_scored)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.19060 0.05406 -3.526 0.000422 ***
## b2b -0.14948 0.03956 -3.778 0.000158 ***
## four_in_six 0.03421 0.04083 0.838 0.402086
## road_trip_index 0.01202 0.02052 0.586 0.558168
## home 0.43869 0.05686 7.715 1.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16161 on 11657 degrees of freedom
## Residual deviance: 16017 on 11653 degrees of freedom
## AIC: 16027
##
## Number of Fisher Scoring iterations: 3
sched_scored$pred_win_prob <- predict(sched_model, newdata = sched_scored, type = "response")
sched_scored <- sched_scored %>%
mutate(expected_wins = ifelse(pred_win_prob > 0.50, 1, 0))
#Actual vs Expected
team_summary <- sched_scored %>%
group_by(team) %>%
summarise(
total_actual_wins = sum(win, na.rm = TRUE),
total_expected_wins = sum(expected_wins, na.rm = TRUE),
diff = total_actual_wins - total_expected_wins,
.groups = "drop"
) %>%
arrange(diff)
most_hurt <- slice_head(team_summary, n = 1)
most_helped <- slice_tail(team_summary, n = 1)
most_helped
## # A tibble: 1 × 4
## team total_actual_wins total_expected_wins diff
## <chr> <int> <dbl> <dbl>
## 1 MIL 260 195 65
most_hurt
## # A tibble: 1 × 4
## team total_actual_wins total_expected_wins diff
## <chr> <int> <dbl> <dbl>
## 1 DET 94 191 -97
ANSWER 9:
I fit a logistic regression model on games from 2019 to 2023 with wins as the outcome and schedule features as predictors. The estimates indicate that back to backs are influential in reducing win probability, while home court advantage increases a team’s chances to win. The four in six and road trip indicators are small so they are not significant to the outcome. Overall, the model points to venue and back to back density as the schedule factors with clear impact.